Fix for race condition in node-join/node-left loop #15521

rahulkarajgikar · 2024-08-30T06:50:35Z

Description

Fix for race condition in node-join/node-left loop.

Scenario where race condition can happen:

Suppose a node disconnects from the cluster due to some normal reason.
This queues a node-left task on cluster manager thread.
Then cluster manager then computes the new cluster state based on the node-left task.
The cluster manager now tries to send the new state to all the nodes and waits for all nodes to ack back.
Suppose this takes a long time due to lagging nodes or slow applying of the state or any other reason.
While this is happening, the node that just left sends a join request to the cluster manager to rejoin the cluster. [This happens in an infinite loop on transport layer and the frequency is controlled by discovery.find_peers_interval setting]
The role of this join request is to re-establish any required connections and do some pre-validations before queuing a new task.
After join request is validated by cluster manager node, cluster manager queues a node-join task into its thread.
This node-join task would only start after the node-left task is completed since cluster-manager is single threaded.
Now suppose the node-left task has completed publication and has started to apply the new state on the cluster manager.
As part of applying the cluster state of node-left task, cluster manager wipes out the connection to the leaving node.
The node-left task then completes and the node-join task begins.
Now the node-join task starts. This task assumes that because the previous join request succeeded, that the connection to the joining node would still be there.
So then the cluster manager computes the new state.
Then it tells the FollowersChecker thread to add this new node.
Then it tries to publish the new state to all the nodes.
However, at this point, the FollowerChecker thread fails with NodeNotConnectedException because the connection was wiped and triggers a new node-left task.
If the new node-left task also takes time, we end up in an infinite loop of node-left and node-join tasks.
Even if the FollowerChecker is modified to handle this NodeNotConnectedException gracefully without triggering a node-left task, the state publication to this joining node still fails because the connection was wiped. So the node-join state never completes on the joining node and it forever remains in candidate phase.

To summarise, if we allow a node-join task into the queue before the node-left task disconnects from the node, we will see the race condition happen.

Fix:

As part of the fix for this, we now reject the initial join request from a node that has an ongoing node-left task.
The join request will only succeed after the node-left task completes committing state on cluster manager, so the connection that gets created as part of the join request does not get wiped out and cause node-join task to fail.

This is done by marking nodes as pending disconnect right before publish state of node-left task.
We mark the nodes as completed disconnect after commit state of node-left task, or on re-election of cluster manager.

If there is a connection request from cluster-manager to any other node during this time, we reject the connection request. This blocks join requests because during join requests, cluster-manager tries to connect to the node trying to join.

The join request will keep retrying, and once the node-left succeeds, the join request will be able to make a connection and succeed.

Main classes:

Coordinator - where publication begins
ClusterConnectionManager - low level connection class. has connection logic and book-keeping. Used by both TransportService and NodeConnectionsService. Core changes made here.
ClusterApplierService - entry point of node connects/disconnects in cluster state commit flow.
NodeConnectionsService - abstraction used only by ClusterApplierService to handle connections/disconnections.
TransportService - called by Coordinator and NodeConnectionsService to connect/disconnect.
NodeJoinLeftIT - IT to simulate the issue

Related Issues

Resolves #4874

Check List

Functionality includes testing.
[N/A] API changes companion pull request created, if applicable.
[N/A] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-08-30T06:59:30Z

❌ Gradle check result for 9496aa1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-30T08:45:30Z

❌ Gradle check result for 94ff71b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-30T14:56:38Z

❌ Gradle check result for 78b6fdd: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

github-actions · 2024-09-02T05:35:42Z

❌ Gradle check result for 8411788: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-02T08:07:34Z

❌ Gradle check result for f0cf40c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/transport/TransportService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/transport/ConnectionManager.java

server/src/main/java/org/opensearch/transport/RemoteConnectionManager.java

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

github-actions · 2024-09-03T07:49:39Z

❌ Gradle check result for a1db9fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-03T09:44:55Z

❌ Gradle check result for 61eac52: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T05:58:32Z

❌ Gradle check result for dd14cc8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rahulkarajgikar · 2024-09-04T06:13:31Z

rebased main

github-actions · 2024-09-04T06:50:32Z

❌ Gradle check result for 4a46aa6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T07:56:41Z

❌ Gradle check result for 412eada: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T14:33:57Z

❌ Gradle check result for dd68f30: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-04T20:10:37Z

❌ Gradle check result for ca48626: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T06:10:05Z

❌ Gradle check result for 36e473d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T07:34:57Z

❌ Gradle check result for 9c788fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-05T08:01:19Z

❌ Gradle check result for e02aa10: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/main/java/org/opensearch/transport/ConnectionManager.java

server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java

server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java

github-actions · 2024-09-06T09:59:36Z

❌ Gradle check result for b0a7ae3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-06T10:15:12Z

❌ Gradle check result for 0d3db12: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rahulkarajgikar · 2024-09-06T10:22:01Z

force pushed to rebase from main

github-actions · 2024-09-06T11:19:23Z

❌ Gradle check result for 9e322c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-06T13:39:31Z

❌ Gradle check result for 85a9e37: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rahul Karajgikar <[email protected]>

github-actions · 2024-09-25T12:06:31Z

❌ Gradle check result for 9a060cc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rahulkarajgikar · 2024-09-25T12:24:16Z

https://build.ci.opensearch.org/job/gradle-check/48387/

org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/410_nested_aggs/Supported queries}
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/410_nested_aggs/Supported queries}
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/410_nested_aggs/Supported queries}
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/410_nested_aggs/Supported queries}

All 4 tests are flaky

Signed-off-by: Rahul Karajgikar <[email protected]>

github-actions · 2024-09-25T14:18:01Z

❌ Gradle check result for 7b0d28b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rahul Karajgikar <[email protected]>

github-actions · 2024-09-25T17:56:24Z

❕ Gradle check result for e0a0ae2: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

* Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]> (cherry picked from commit 1563e1a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

(cherry picked from commit 1563e1a) Signed-off-by: Rahul Karajgikar <[email protected]>

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

rajiv-kv reviewed Aug 30, 2024

View reviewed changes

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java Outdated Show resolved Hide resolved

server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java Outdated Show resolved Hide resolved

rajiv-kv reviewed Sep 2, 2024

View reviewed changes

rahulkarajgikar force-pushed the race_condition_2 branch from 4a46aa6 to d7be8b7 Compare September 4, 2024 06:12

github-actions bot added bug Something isn't working Cluster Manager labels Sep 5, 2024

rahulkarajgikar force-pushed the race_condition_2 branch from 9c788fb to e02aa10 Compare September 5, 2024 07:25

rajiv-kv reviewed Sep 5, 2024

View reviewed changes

rahulkarajgikar force-pushed the race_condition_2 branch from 0d3db12 to 9e322c4 Compare September 6, 2024 10:21

opensearch-ci-bot mentioned this pull request Sep 6, 2024

[AUTOCUT] Gradle Check Flaky Test Report for PluginInfoIT #15814

Open

Rahul Karajgikar added 4 commits September 25, 2024 10:44

remove unused code

fff66e1

Signed-off-by: Rahul Karajgikar <[email protected]>

add assertions on exception message

db3d23a

Signed-off-by: Rahul Karajgikar <[email protected]>

changes to tests based on comments

a6a6c38

Signed-off-by: Rahul Karajgikar <[email protected]>

empty commit

1ded2cc

Signed-off-by: Rahul Karajgikar <[email protected]>

rahulkarajgikar force-pushed the race_condition_2 branch from a107371 to 1ded2cc Compare September 25, 2024 10:47

update nodeconnectionsservice test

9a060cc

Signed-off-by: Rahul Karajgikar <[email protected]>

rajiv-kv approved these changes Sep 25, 2024

View reviewed changes

empty commit for gradle check

7b0d28b

Signed-off-by: Rahul Karajgikar <[email protected]>

empty commit

e0a0ae2

Signed-off-by: Rahul Karajgikar <[email protected]>

This was referenced Sep 25, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RecoveryFromGatewayIT #14304

Open

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT #15944

Open

[AUTOCUT] Gradle Check Flaky Test Report for IndexServiceTests #14407

Open

shwetathareja approved these changes Sep 28, 2024

View reviewed changes

shwetathareja added the backport 2.x Backport to 2.x branch label Sep 28, 2024

shwetathareja merged commit 1563e1a into opensearch-project:main Sep 28, 2024
37 of 38 checks passed

opensearch-trigger-bot bot mentioned this pull request Sep 28, 2024

[Backport 2.x] Fix for race condition in node-join/node-left loop #16118

Merged

shwetathareja pushed a commit that referenced this pull request Sep 30, 2024

Fix for race condition in node-join/node-left loop (#15521) (#16118)

8f585d0

(cherry picked from commit 1563e1a) Signed-off-by: Rahul Karajgikar <[email protected]>

hainenber pushed a commit to hainenber/OpenSearch that referenced this pull request Oct 1, 2024

Fix for race condition in node-join/node-left loop (opensearch-projec…

6fa4fa8

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

ruai0511 pushed a commit to ruai0511/OpenSearch that referenced this pull request Oct 4, 2024

Fix for race condition in node-join/node-left loop (opensearch-projec…

44bc079

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

prudhvigodithi mentioned this pull request Oct 16, 2024

[AUTOCUT] Gradle Check Flaky Test Report for MixedClusterClientYamlTestSuiteIT prudhvigodithi/opensearch-build#82

Open

dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 16, 2024

Fix for race condition in node-join/node-left loop (opensearch-projec…

2f1457e

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 17, 2024

Fix for race condition in node-join/node-left loop (opensearch-projec…

a75b0fc

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 21, 2024

Fix for race condition in node-join/node-left loop (opensearch-projec…

7e3024e

…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>

opensearch-ci-bot mentioned this pull request Oct 23, 2024

[AUTOCUT] Gradle Check Flaky Test Report for UpdateByQueryBasicTests #16439

Open

BrewTestBot mentioned this pull request Nov 6, 2024

opensearch 2.18.0 Homebrew/homebrew-core#196785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for race condition in node-join/node-left loop #15521

Fix for race condition in node-join/node-left loop #15521

rahulkarajgikar commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 4, 2024

rahulkarajgikar commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

rahulkarajgikar commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 25, 2024

rahulkarajgikar commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

Fix for race condition in node-join/node-left loop #15521

Fix for race condition in node-join/node-left loop #15521

Conversation

rahulkarajgikar commented Aug 30, 2024 • edited Loading

Description

Fix:

Main classes:

Related Issues

Check List

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 4, 2024

rahulkarajgikar commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 4, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

rahulkarajgikar commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 6, 2024

github-actions bot commented Sep 25, 2024

rahulkarajgikar commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

rahulkarajgikar commented Aug 30, 2024 •

edited

Loading